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Graphical models are frequently used to explore networks, such 
as genetic networks, among a set of variables. This is usually carried 
out via exploring the sparsity of the precision matrix of the variables 
under consideration. Penalized likelihood methods are often used in 
such explorations. Yet, positive-definiteness constraints of precision 
matrices make the optimization problem challenging. We introduce 
nonconcave penalties and the adaptive LASSO penalty to attenuate 
the bias problem in the network estimation. Through the local lin- 
ear approximation to the nonconcave penalty functions, the problem 
of precision matrix estimation is recast as a sequence of penalized 
likelihood problems with a weighted Li penalty and solved using the 
efficient algorithm of Friedman et al. [Bio statistics 9 (2008) 432-441]. 
Our estimation schemes are applied to two real datasets. Simulation 
experiments and asymptotic theory are used to justify our proposed 
methods. 

1. Introduction. Network modeling is often explored via estimating the 
sparse precision matrix, the inverse covariance matrix, in which off-diagonal 
elements represent the conditional covariance between the corresponding 
variables. The sparsity is often studied via penalized likelihood, with an 
appropriately chosen penalty function. The results are usually summarized 
graphically by linking conditionally dependent variables. This provides an 
understanding of how variables, such as the coexpression of genes, are re- 
lated to each other. A challenge in network modeling is to optimize the 
penalized likelihood, subject to the positive-definiteness constraint of the 
precision matrix. Further challenges arise in reducing the biases induced by 
the penalized likelihood method. 
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a p-dimension random vector having a mul- 
tivariate normal distribution with mean vector /x and covariance matrix XI. 
The research on large covariance matrix estimation has surged recently due 
to high-dimensional data, generated by modern technologies such as mi- 
croarray, fMRI and so on. In many applications like gene classifications and 
optimal portfolio allocations it is the precision matrix, denoted hy Cl = 
that is needed and plays an important role. It has a nice interpretation in 
the Gaussian graphical model, as the (z, j)-element of fi is exactly the par- 
tial correlation between the ith. and jth components of X. In the Gaussian 
concentration graphical model with undirected graph {V,E), vertices V cor- 
respond to components of the vector X and edges E = {eij,l <i,j < p} 
indicate the conditional dependence among different components of X. The 
edge Cij between Xi and Xj exists if and only if coij 7^ 0, where iVij is the 
(i, j)-element of fi. Hence, of particular interest is to identify null entries in 
the precision matrix. 

There is significant literature on model selection and parameter estima- 
tion in the Gaussian concentration graphical model. The seminal paper by 
Dempster (1972) discussed the idea of simplifying the covariance structure 
by setting some elements of the precision matrix to zero. Initially the meth- 
ods of precision matrix estimation were based on two steps: (1) identify the 
"correct" model; (2) estimate the parameters for the identified model. One 
standard approach for identifying the model is the greedy stepwise forward- 
selection (or backward-selection) , which is achieved through hypothesis test- 
ing; see Edwards (2000) for an extensive introduction. Drton and Perlman 
(2004) noted that it is not clear whether the stepwise method is valid as a si- 
multaneous testing procedure because its overall error rate is not controlled. 
To improve this stepwise method, Drton and Perlman (2004) proposed a con- 
servative simultaneous confidence interval to select model in a single step. 
Using the least absolute shrinkage and selection operator (LASSO) [Tib- 
shirani (1996)], Meinshausen and Biihlmann (2006) proposed to perform a 
neighborhood selection at each node in the graph. This neighborhood selec- 
tion is computationally very fast and suitable for large-size problems. 

The instability of the aforementioned two-step procedures has been rec- 
ognized by Breiman (1996). Fan and Li (2001) proposed the penalized likeli- 
hood, which can achieve model selection and parameter estimation simulta- 
neously. This penalized likelihood was later studied by d'Aspremont, Baner- 
jee and Ghaoui (2008), Yuan and Lin (2007), Levina, Zhu and Rothman 
(2008), Rothman et al. (2008) and Friedman, Hastie and Tibshirani (2008) 
in the context of precision matrix estimation. Yuan and Lin (2007) solved the 
corresponding optimization problem using the MAXDET algorithm [Van- 
denberghe, Boyd and Wu (1998)] and focused on statistical properties of 
the estimates. d'Aspremont, Banerjee and Ghaoui (2008) proposed two ef- 
ficient first-order numerical algorithms with low memory requirement using 
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semidefinite programming algorithms, which obey the positive-definiteness 
constraint of the precision matrix. Rothman et al. (2008) and Lam and Fan 
(2008) showed that the Probenius norm between the inverse correlation ma- 
trix and its Li penalized likelihood estimator is Op{y^ Slogp/n), where S is 
the number of the nonzero elements of the inverse of the correlation matrix. 
Consequently, the sparse inverse correlation matrix is highly estimable and 
the dimensionality only costs an order of logp, a remarkable improvement 
on the general result of Fan and Peng (2004). Using a coordinate descent 
procedure, Friedman, Hastie and Tibshirani (2008) proposed the graphical 
lasso algorithm to estimate the sparse inverse covariance matrix using the 
LASSO penalty. The graphical lasso algorithm is remarkably fast. 

The Li penalty is convex and leads to a desirable convex optimization 
problem when the log-likelihood function is convex. Recent innovation of 
the LARS algorithm [Efron et al. (2004)] enables computation of the whole 
solution path of the Li penalized regression within 0{'n?p) operations. This 
is a remarkable achievement. However, such an algorithm does not apply to 
the estimation of the precision matrix, whose parameters are subject to a 
positive-definiteness constraint of the matrix. 

It has been shown that the LASSO penalty produces biases even in the 
simple regression setting [Fan and Li (2001)] due to the linear increase of 
the penalty on regression coefficients. To remedy this bias issue, two new 
penalties were proposed recently: one is the nonconcave penalty, such as the 
Smoothly Clipped Absolute Deviation (SCAD) penalty [Fan and Li (2001)], 
and the other is the adaptive LASSO penalty due to Zou (2006). In this 
work we will study precision matrix estimation using these two penalty func- 
tions. Lam and Fan (2008) studied theoretical properties of sparse precision 
matrices estimation via a general penalty function satisfying the proper- 
ties in Fan and Li (2001). The bias presented in the LASSO penalty is also 
demonstrated for sparse precision matrix estimation in Lam and Fan (2008). 
Through the local linear approximation [Zou and Li (2008)] to the noncon- 
cave penalty function, the nonconcave penalized likelihood can be recast 
as a sequence of weighted Li penalized likelihood problems. The weighting 
scheme is governed by the derivative of the penalty function, which depends 
on the magnitude of the current estimated coefficient: the larger magnitude 
the smaller weight. Therefore, the optimization of the penalized likelihood 
with a nonconcave penalty subject to the positive-definiteness constraint of 
Q can be elegantly solved by the efficient algorithm of Friedman, Hastie and 
Tibshirani (2008). In this way, we simultaneously solve the bias issue and 
reduce the computational burden. 

Other recent work on Gaussian concentration graphical models includes 
the following: Li and Qui (2006), who introduced a threshold gradient de- 
scent (TGD) regularization procedure for the sparse precision matrix esti- 
mation; Schafer and Strimmer (2005), who estimated the correlation matrix 
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via regularization with bootstrap variance reduction and used false discovery 
rate multiple testing to select network based on the estimated correlation 
matrix; Bayesian approaches considered in Wong, Carter and Kohn (2003) 
and Dobra et al. (2004); Huang et al. (2006), who reparameterized the co- 
variance matrix through the modified Cholesky decomposition of its inverse 
and transferred covariance matrix estimation to the task of model selection 
and estimation for a sequence of regression models, among others. 

The rest of the paper is organized as follows. Section 2 describes the algo- 
rithm for precision matrix estimation and three types of penalties in detail. 
In Section 3 our methods are applied to two real datasets: telephone call 
center data [Shen and Huang (2005)] and pCR development of breast can- 
cer [Hess et al. (2006)]. Section 4 uses Monte Carlo simulation to compare 
the performance of the three kinds of penalty functions under considera- 
tion. Theoretical properties of the SCAD and adaptive LASSO penalized 
approach are used to justify our methods in Section 5. The Appendix col- 
lects all the technical proofs. 

2. Methods. Suppose xi,X2,...,x„ are from a Gaussian distribution 
with unknown mean vector /jg covariance matrix Sq, denoted as N{fXQ, ^o), 
where Xj = (xji,Xi2, • • • ,Xip)'^ . Denote the sample covariance matrix by S, 
whose {j, A;)-element djk is given by J27=ii^ij ~ ^j){xik — Xk)/n, where Xj = 
Y^^=iXij /n is the sample mean of the jth component. Note that we use n 
instead of n — p in the definition of the sample covariance matrix so that the 
log-likelihood function of the precision matrix can be written in a compact 
format as in (2.1). 

2.1. Penalized likelihood estimation. The precision matrix = I]^^ is 
estimated by maximizing twice the log-likelihood function, which is given 
by 

(2.1) 2/(n) = log detn- Constant, 

where (1], fl) = tr(Sri) denotes the trace of the product matrix 5]ri. When 

n> p, the global maximizer of 1{Q) is given by = 5] 

Denote the generic penalty function on each element by p{-). Under the 
penalized likelihood framework, the estimate of the sparse precision matrix 
is the solution to the following optimization problem: 

(2.2) max log det - (S , fi) - ^ ^ px,^ (|a;,, | ) , 

J = lj = l 

where ujij is the (i, j)-element of matrix ft and Xij is the corresponding 
tuning parameter. 
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The LASSO penalty proposed by Tibshirani (1996) achieves sparsity in 
the regression setting. Essentially, the LASSO penalty uses the Li penalty 
function: p;j(|x|) = A|x|. Friedman, Hastie and Tibshirani (2008) applied the 
LASSO penalty to (2.2) and proposed the graphical lasso algorithm by using 
a coordinate descent procedure, which is remarkably fast. Moreover, this 
algorithm allows a "warm" start, from which we can use the estimate for 
one value of the tuning parameter as the starting point for the next value. 

Numerical examples show that the LASSO penalty can produce a sparse 
estimate of the precision matrix. However, the LASSO penalty increases 
linearly in the magnitude of its argument. As a result, it produces substantial 
biases in the estimates for large regression coefficients. To address this issue. 
Fan and Li (2001) proposed a unified approach via nonconcave penalties. 
They gave necessary conditions for the penalty function to produce sparse 
solutions, to ensure consistency of model selection, and to result in unbiased 
estimates for large coefficients. All three of these desirable properties are 
simultaneously achieved by the SCAD penalty, proposed by Fan (1997). 
Mathematically, the SCAD penalty is symmetric and a quadratic spline on 
[0, oo), whose first order derivative is given by 

(2.3) SCAD',,,(x) = a{/(|x| < A) + i^^^M±/(|^| > a)| 

for X > 0, where A > and a > 2 are two tuning parameters. When a = oo, 

(2.3) corresponds to the Li penalty. Based on an argument of minimizing 
the Bayes risk, Fan and Li (2001) recommended the choice a = 3.7, which 
will be used in all of our numerical examples. Using the SCAD penalty, we 
are seeking to solve the following optimization problem: 

(2.4) maxlogdetn - {±,ft) - SCAD^.^dc^^jl), 

l = l]=l 

where we set Ay = A for convenience. 

Zou (2006) proposed another method to achieve the aforementioned three 
desirable properties simultaneously. It is called the adaptive LASSO penalty, 
and requires a weight for each component. The adaptive LASSO penalty is 
essentially a weighed version of the LASSO penalty with some properly 
chosen weights. For our setting, we define the adaptive weights to be Wij = 
l/\ujij\'^ for some 7 > and any consistent estimate ft = {'^ij)i<i,j<p- Putting 
the adaptive LASSO penalty into (2.2), we get 

p p 

(2.5) maxlogdetfi — — Ay^ Wij\uJij\. 

1=1 j = l 

This method was proposed by Zou (2006) in the regression setting. Accord- 
ing to our numerical experience, estimation results do not differ much for 
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different 7. So, for simplicity, we fix 7 = 0.5 in all our numerical analysis. 
Furthermore, the initial estimate ft can be chosen as the inverse sample co- 
variance matrix for the case p < n or the precision matrix estimate derived 
from the LASSO penalty for the case p>n. Note that the inverse sample 
covariance matrix when p < n may not be consistent if we allow p to grow 
with n. This requirement of a consistent initial estimate is a drawback of 
the adaptive LASSO. In the next subsection we elucidate the connection of 
the nonconcave penalty to the adaptive LASSO penalty. 

2.2. Iterative reweighted penalized likelihood. To reduce the biases for es- 
timating nonzero components. Fan and Li (2001) pointed out a necessary 
condition that the penalty function p\{-) should be nondecreasing over [0, 00) 
while leveling off near the tail. Hence, the penalty function needs to be con- 
cave on [0, 00). At the time, in the absence of the innovative LARS algorithm 
[Efron et al. (2004)], they proposed the LQA algorithm, which conducts the 
optimization iteratively and in each step approximates the SCAD penalty 
via a quadratic function. Hunter and Li (2005) studied the LQA in a more 
general framework in terms of the MM (minorize-maximize) algorithm and 
showed its nice asymptotic properties. The SPICE of Rothman et al. (2008) 
is also based on the LQA algorithm. For both the LQA and MM algo- 
rithms, Friedman, Hastie and Tibshirani (2008) 's graphical lasso algorithm 
cannot directly be applied because the penalty is locally approximated by a 
quadratic function. 

In this work, to take advantage of the graphical lasso algorithm of Fried- 
man, Hastie and Tibshirani (2008), we resort to the local linear approxi- 
mation (LLA), proposed in Zou and Li (2008), which is an improvement 
of the LQA in Fan and Li (2001). In each step, the LLA algorithm locally 
approximates the SCAD penalty by a symmetric linear function. For any 
loq, by the Taylor expansion, we approximate pAd"^!) in a neighborhood of 
I Wo I as follows: 

PxiH) ^ Px{\^o\) +PA(ko|)(|w| - Iwol), 

where p'xi^) = ^Pa('^)i which is nonnegative for lo G [0, 00) due to the mono- 

- (fc) 

tonicity of p\{-) over [0,oo). Denote the k-step solution by ft . Conse- 
quently, at step k, we are optimizing, up to a constant, 

p p 

(2.6) maxlogdetfi — (S,ri) — Wij\uJij\, 

1 = 1 J = l 

where Wij = p'xH'^ij'^ \) anda)|^^ is the (i, j)-element of f2*' "\ The optimization 
problem (2.6) can be easily solved by the graphical lasso algorithm proposed 
by Friedman, Hastie and Tibshirani (2008). 
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At each step, (2.6) is equivalent to a weighted version of the Li-penaUzed 
Ukehhood, leading to a sparse solution. The weighting scheme is governed 
by the derivative of the penalty function and the magnitude of the current 
estimate: the larger magnitude the smaller weight. In Theorem 5.1, we show 
that the penalized likelihood objective function is increasing through each 
iteration in the LLA algorithm. Due to the sparsity in each iteration, Zou 
and Li (2008) studied the one-step LLA algorithm and showed that, asymp- 
totically, the one-step algorithm performs as well as the fully iterative LLA 
algorithm as long as the initial solution is good enough. As a result, we sim- 
ply use the one-step LLA algorithm in this work. In our implementation, the 
initial value is taken as either the inverse sample covariance matrix or the 
LASSO estimate of the precision matrix. The latter is equivalent to using 

(2.6) twice starting with the primitive initial value Cl^^'' = 0, resulting in the 

LASSO estimate fl^ ^ in the first step as SCAD^ ^(0) = A. This also demon- 
strates the flexibility of the SCAD penalty: an element being estimated as 
zero can escape from zero in the next iteration, whereas the adaptive LASSO 
absorbs zeros in each application (the estimate is always sparser than the 
initial value). 

2.3. Tuning parameter selection. As in every regularization problem, the 
tuning parameter A controls the model complexity and has to be tuned 
for each penalty function. In this work we use the popular K-iold cross- 
validation method to do the tuning parameter selection. First divide all 
the samples in the training dataset into K disjoint subgroups, also known 
as folds, and denote the index of subjects in the kth fold by for k = 
1,2, ... ,K. The K-fold cross-validation score is defined as 

CV{X) = f: Llog|n-fc(A)| - (x»)^n_fc(A)x«] , 

k=l \ ieTk / 

where is the size of the kth fold and r2_fc(A) denotes the estimate 
of the precision matrix based on the sample (\J^^iTk)\Tk with A as the 
tuning parameter. Then, we choose A* = argmax;^ CV{X) as the best tuning 
parameter, which is used to obtain the final estimate of the precision matrix 
based on the whole training set UlLi^^fc- Here the maximization of CV{X) 
with respect to A is achieved via a grid search. 

3. Application to real data. In this section we apply our estimation 
scheme to two real datasets and compare the performance of three different 
penalty functions: the LASSO, adaptive LASSO and SCAD. 



8 



J. FAN, Y. FENG AND Y. WU 



3.1. Telephone call center data. In this example our method is applied 
to forecast the call arrival pattern of a telephone call center. The data come 
from one call center in a major U.S. northeastern financial organization, 
containing the information about the arrival time of every call at the service 
queue. Phone calls are recorded from 7:00AM until midnight for each day 
in 2002, except 6 days when the data-collecting equipment was out of order. 
More details about this data can be found in Shen and Huang (2005). 

We take the same data preprocessing as in Huang et al. (2006): (1) divide 
the 17-hour period into 102 10-minute intervals; (2) count the number of 
calls arriving at the service queue during each interval; (3) focus on week- 
days only; (4) use the singular value decomposition to screen out outliers 
that include holidays and days when the recording equipment was faulty. 
Finally, we have observations for 239 days. Denote the data for day i by 
Nj = {Nil, ■ ■ ■ , -^1,102)', for i = 1, . . . , 239, where Nu is the number of calls 
arriving at the call center for the tth 10-minute interval on day i. Define 
yn = yj Nit + 1/4 using the variance stabilization transform for i = 1, . . . , 239 
and t = 1,...,102. We apply the penalized likelihood estimation method 
with three different penalty functions: the LASSO, adaptive LASSO and 
SCAD, to estimate the 102 x 102 precision matrix. As in Huang et al. (2006), 
we use the estimated precision matrix to forecast the number of arrivals 
later in the day using arrival patterns at earlier times of the day. Denote 

yi = {yn, ■ ■ • ,2/1,102)'- Then form the partition y, = (y^^ ,7] )', where y) 
(2) 

and Yi represent the arrival patterns in the early and the later time of day 

i. Here we can take yf ^ = {yn, . . . ,yi,5i)' and yf ^ = (yj,52, ■ • ■ ,yi,i02)'- The 
corresponding partition of the mean and covariance matrix is 

^ (^2) ' (^21 5^22 

With the multivariate normality assumption, the best mean squared error 
forecast of y|^^ using y^^"^^ is 

^(yfVf^) = M2 + 5]2i5]r/(yS'^-/xi), 

which is also the best linear predictor for non-Gaussian data. 

To evaluate the forecasting performance, we split the 239 days into train- 
ing and testing days. The data from the first 205 days, corresponding from 
January to October, is used as the training dataset to estimate the mean 
vector /I and the precision matrix = S~^. The remaining 34 days are used 
for testing. We define the average absolute forecast error (AAFE) by 

239 

AAFEt = — ^ \yit-yit\, 

i=206 
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Table 1 

Average result of call center prediction 





Sample 


LASSO 


Adaptive LASSO SCAD 


Average AAFE 
Nonzero elements in 


1.46 

10,394 


1.39 

2788 


1.34 1.31 
1417 684 



where yu and yu are the observed and the predicted values, respectively. In 
Figure 1 we compare the AAFE performance using the sample covariance 
matrix and the penalized estimates with the LASSO, adaptive LASSO and 
SCAD penalties. In Table 1 we give the average AAFE of the 34 days we 
set aside for testing and also the number of the nonzero elements in the 
precision matrix estimate of the four methods. Here and in all following 
numerical studies, we let the element ujij of the precision matrix be zero if 
\ujij\ < 10~^, because the default threshold for convergence in the graphical 
lasso algorithm is 10""*. We have tried several other thresholding levels, such 
as 10~^ and 10~^, and obtained similar conclusions in both real data analysis 
and simulations. 

Figure 1 and Table 1 show clearly that the forecasts based on the pe- 
nalized estimates are better than that based on the sample covariance ma- 
trix. Among the three penalized estimates, the estimate associated with the 
SCAD penalty performs the best, followed by the adaptive LASSO, and fi- 
nally the LASSO forecast. Moreover, we can see that the sample precision 
matrix is a nonsparse precision matrix and leads to a much more complex 
network than the penalized ones. Comparing to the LASSO, the adaptive 
LASSO leads to a simpler network and the SCAD provides an even simpler 
network, resulting in the smallest forecasting errors. The reason is that the 
SCAD penalty results in the least biased estimate among three penalized 

2.2 - 
2 - 

m - 

iJS- 
14 - 
1.2 - 

1 - 

OB - 

0,6 - 

50 60 70 BO 90 100 110 

Fig. 1. Average absolute forecast error AAAEt against t = 52, . . . , 102 using the sample 
estimate and using three penalties: LASSO, adaptive LASSO and SCAD. 
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schemes. This allows the data to choose a larger penalty parameter A for 
the SCAD penalty and set more spurious zeros to zero. This phenomenon 
will also be observed and demonstrated in the simulation studies. 

3.2. Breast cancer data. As a second example, we focus on selecting gene 
expression profiling as a potential tool to predict the breast cancer patients 
who may achieve pathologic Complete Response (pCR), which is defined 
as no evidence of viable, invasive tumor cells left in the surgical specimen. 
As in Kuerer et al. (1999), pCR after neoadjuvant chemotherapy has been 
described as a strong indicator of survival, justifying its use as a surrogate 
marker of chemosensitivity. Consequently, considerable interest has been 
developed in finding methods to predict which patients will have a pCR to 
preoperative therapy. In this study we use the normalized gene expression 
data of 130 patients with stages I-III breast cancers analyzed by Hess et al. 
(2006). Among the 130 patients, 33 of them are from class 1 (achieved pCR), 
while the other 97 belong to class 2 (did not achieve pCR). 

To evaluate the performance of the penalized precision matrix estimation 
using three different penalties, we randomly divide the data into training 
and testing sets of sizes 109 and 21, respectively, and repeat the whole 
process 100 times. To maintain similar class proportion for the training and 
testing datasets, we use a stratified sampling: each time we randomly select 
5 subjects from class 1 and 16 subjects from class 2 (both are roughly 1/6 
of their corresponding total class subjects) and these 21 subjects make up 
the testing set; the remaining will be used as the training set. From each 
training data, we first perform a two-sample i-test between the two groups 
and select the most significant 110 genes that have the smallest p-values. 
In this case, the dimensionality p = 110 is slightly larger than the sample 
size n = 109 for training datasets in our classification study. Due to the 
noise accumulation demonstrated in Fan and Fan (2008), p= 110 may be 
larger than needed for optimal classification, but allows us to examine the 
performance when p> n. Second, we perform a gene- wise standardization by 
dividing the data with the corresponding standard deviation, estimated from 
the training dataset. Finally, we estimate the precision matrix and consider 
the linear discriminant analysis (LDA). LDA assumes that the normalized 
gene expression data in class-A; is normally distributed as A^(/2^, XI) with the 
same covariance matrix, where k = 1,2. The linear discriminant scores are 
as follows: 



where TCk = rik/n is the proportion of the number of observations in the 
training data belonging to the class k, and the classification rule is given by 
arg max^ Jfc (x) . Details for LDA can be found in Mardia, Kent and Bibby 




^-1 



1 



NETWORK EXPLORATION VIA THE ADAPTIVE LASSO AND SCAD PENALTIES 



(1979). Based on each training dataset, we can estimate the with-in class 
mean vectors by 

LLu = — Xj forfc = l,2 

iSclass-fc 

and precision matrix using the penaUzed loghkehhood method with 
three different penalty functions: the LASSO, adaptive LASSO and SCAD. 
Tuning parameters in different methods are chosen via six-fold cross-validation 
based on the training data. Note that the sample size n is smaller than the 
dimensionality p in this case. As a result, the sample covariance matrix is 
degenerate and cannot be used in the LDA. 

To compare the prediction performance, we used specificity, sensitivity 
and also Matthews Correlation Coefficient (MCC). They are defined as fol- 
lows: 

TN TP 
Specificity = ^^^pp , Sensitivity = ^p^^^^ , 

^^CC TPxTN-FPxFN 

^(TP + FP)(TP + FN)(TN + FP)(TN + FN) ' 

where TP, TN, FP and FN are the numbers of true positives, true negatives, 
false positives and false negatives, respectively. MCC is widely used in ma- 
chine learning as a measure of the quality of binary classifiers. It takes true 
and false, positives and negatives, into account and is generally regarded 
as a balanced measure, which can be used even if the classes are of very 
different sizes. The larger the MCC is, the better the classification is. More 
details can be found in Bladi et al. (2000). Means and standard deviations (in 
parentheses) of the specificity, sensitivity, MCC and the number of nonzero 
elements in over 100 repetitions are reported in Table 2. To visually in- 
terpret the gene network derived by our penalized likelihood methods, we 
applied our whole estimation scheme to all the 130 datasets: (1) use a two 
sample t-test to select 110 genes; (2) use the penalized likelihood estimation 
scheme to derive the precision matrix estimates. Next we try to show the 
corresponding gene networks derived by using three different penalties. To 
gain a better view, we only plot the gene networks of the 60 genes with the 
smallest p- values among the 110 genes in Figure 2. 

From the table, we can see that the adaptive LASSO and SCAD improve 
over the LASSO in terms of the specificity and MCC, while all three penalties 
give similar sensitivity. Furthermore, when we look at the number of nonzero 
elements of the precision matrix estimates, using three different penalties, we 
can see again that, by using the adaptive LASSO and SCAD penalties, we 
can get much simpler models which are often more desirable. From Figure 
2, it is clear that, compared with the network derived using the LASSO 
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Table 2 

Result of pCR classification over 100 repetitions 





Specificity 


Sensitivity 


MCC 


Nonzero elements in ! 


LASSO 


0.768 (0.096) 


0.630 (0.213) 


0.366 (0.176) 


3923 (18) 


Adaptive LASSO 


0.787 (0.093) 


0.622 (0.218) 


0.381 (0.183) 


1233 (8) 


SCAD 


0.794 (0.098) 


0.634 (0.220) 


0.402 (0.196) 


674 (12) 



penalty, the ones derived using the adaptive LASSO and SCAD penalties 
both show some small clusters, indicating block diagonal precision matrices. 
This interesting phenomenon is worth further study. 

4. Monte Carlo simulation. In this section we use simulations to exam- 
ine the performance of the penalized log-likelihood approach proposed in 
Section 2, to estimate the precision matrix with different penalties. In the 
first three examples, we set the dimensionality p = 30. Three different data 
generating settings for the 30 x 30 precision matrix Q, are considered in 
Examples 4.1, 4.2 and 4.3. In Examples 4.4 and 4.5 we consider the cor- 
responding high dimensional case with p = 200 for Examples 4.1 and 4.2, 
respectively. In each example we first generate a true precision matrix Q, 
which will be fixed for the whole example. Next we generate a dataset of 
n = 120 i.i.d. random vectors distributed as N{Q,Q,~^). For each simulated 
dataset and each penalty a 6-fold cross-validation scheme is used to tune 
the regularization parameter as discussed in Section 2.3. 

To compare the performance of different estimators corresponding to the 
three penalty functions under consideration, the LASSO, adaptive LASSO 
and SCAD, we use two types of loss functions: the entropy loss and the 
quadratic loss [Lin and Perlman (1985)] defined by 

lossi(n,n) =trn"^n-log|n^^n|-n and \oss2{n,n) =iT{n-^n- if , 




Fig. 2. Gene networks derived using three penalties: the LASSO (left panel), the adaptive 
LASSO (middle panel) and the SCAD (right panel). 
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Table 3 
Simulation result of Example 4-1 

lossi I0SS2 zeroi zero2 percj^ perc^ 

LASSO 1.64 (0.15) 11.06(6.64) 248.48 (60.02) 0.02 (0.20) 30.60 (7.39) 0.02 (0.23) 

Adaptive LASSO 1.14 (0.16) 7.44(4.45) 42.58 (28.71) 0.16 (0.56) 5.24 (3.54) 0.18 (0.62) 
SCAD 0.83 (0.24) 2.49(3.78) 76.89 (23.58) 0.18 (0.58) 9.47 (2.90) 0.20 (0.65) 



respectively, where ft is an estimate of the true precision matrix CI. To eval- 
uate the performance of the three different penalties concerning sparsity, we 
report two types of errors regarding zero elements: zeroi means the number 
of type-I errors (i.e., the true entry of the precision matrix is nonzero but 
the corresponding estimate is zero) and zero2 the number of type-II errors 
(i.e., the true entry is zero but its estimator is nonzero). Ideally, we would 
like to have small zeroi and zero2. We also calculate the relative error per- 
centages: perci = 100 X zeroi/Ni and perc2 = 100 x zero2/N2, where Ni and 
N2 are the number of zeros and nonzeros of the true precision matrix re- 
spectively. Results of lossi, I0SS2, zeroi, zero2, perc;^ and perc2 over the 100 
simulations are reported for each simulation example. We will summarize 
the performance at the end of this section. 

Example 4.1 [Tridiagonal case (n = 120, p = 30)]. In this first example 
we consider the case with a tridiagonal precision matrix, which is associ- 
ated with the autoregressive process of order one [i.e., AR(1) covariance 
structure]. In this case the covariance matrix S is a px p matrix with 
element aij = exp {—a\si — Sj\), where si < S2 < • • • < for some a > 0. 
Here, we choose 

Si - Si-i ' Unif(0.5, 1), i = 2,...,p. 

The precision matrix is set as = The performance of three penalties 
over 100 repetitions is reported in Table 3, which presents the means of 
zeroi, zero2, lossi, I0SS2, percj^ and perc2 with their corresponding standard 
errors in parentheses. 

It is not realistic to plot the individual sparsity pattern of the estimates for 
all the repetitions. Instead we plot the average sparsity pattern, the relative 
frequency matrix, for each penalty. More specifically, the (i, j)-element of 
the relative frequency matrix is defined as the relative frequency of nonzero 
estimates of the (i, j)-element of the precision matrix ft throughout the 100 
repetitions. For example, the diagonal elements ua have estimates that are 
always nonzero and, as a result, their corresponding relative frequencies are 
always one. We plot this average sparsity pattern using different penalties 
in panels B, C and D of Figure 3. The true precision matrix is given in 
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panel A of Figure 3. We render this kind of sparsity pattern graph using the 
gray-scale version of "imagesc" function in Matlab. 

Example 4.2 [General case (n = 120, p = 30)]. In the second example 
we consider a general sparse precision matrix and use the data generating 
scheme of Li and Gui (2006). More specifically, we generate p points ran- 
domly on the unit square and calculate all their pairwise distances. For each 
point, define its k nearest neighbors as those with k smallest distances to this 
point. By choosing different number k, we can obtain graphs for this model 
with different degrees of sparsity. For each "edge," the corresponding ele- 
ment in the precision matrix is generated uniformly over [—1, —0.5] U [0.5, 1]. 
The value of the ith diagonal entry is set as a multiple of the sum of the ab- 
solute values of the ith row elements excluding the diagonal entry. Here we 
chose a multiple of 2 to ensure that the obtained precision matrix is positive 
definite. Finally, each row is divided by the corresponding diagonal element 
so that the final precision matrix has diagonal elements of ones. Numerical 
results are summarized in Figure 4 and Table 4. 

Example 4.3 [Exponential decay matrix (n = 120, p = 30)]. In this 
example we consider the case that no element of the precision matrix is 
exactly zero. The (z, j)-element of the true precision matrix is given by 
LOij = exp(— 2|i — j|), which can be extremely small when \i — j\ is large. 
Numerical results over 100 repetitions in the same format as Example 4.1 
are reported in Table 5 and Figure 5. Notice in Figure 5, panel A shows the 
sparsity pattern, since we apply the threshold to the true precision matrix 
as to the three estimates. 

Example 4.4 [High dimensional tridiagonal case (n = 120, p = 200)]. 
The previous three examples belong to the classical setting with dimen- 
sionality p smaller than the sample size n. Next we investigate the high 

true LASSO adaptive U\SSO SCAD 




Fig. 3. For the 100 samples in Example 4-1, the average sparsity pattern recovery for 
the LASSO, adaptive LASSO and SCAD penalties are plotted in panels B, C and D, 
respectively, to compare with the true sparsity pattern (panel A). 
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Fig. 4. For the 100 samples in Example 4-2, the average sparsity pattern recovery for 
the LASSO, adaptive LASSO and SCAD penalties are plotted in panels B, C and D, 
respectively, to compare with the true sparsity pattern (panel A). 

dimensional case with p> n. In this example we keep all the data genera- 
tion process of Example 4.1 except that we increase the dimensionality p to 
200. The simulation result is reported in Table 6 and Figure 6. 

Example 4.5 [High dimensional general case (n = 120, p = 200)]. In 
this example we use the same setting as that of Example 4.2 but increase p 
to 200 as we did in Example 4.4. The simulation results are summarized in 
Table 7 and Figure 7. 

Throughout all these different examples, we can see that the LASSO 
penalty, in general, produces more nonzero elements in the estimated preci- 
sion matrix than the adaptive LASSO and SCAD penalties. This is due to 
the bias inherited in the LASSO penalty that prevents data from choosing 
a large regularization parameter. The adaptive LASSO produces the most 
sparse pattern due to the specific choice of the initial estimate. Based on 

Table 4 
Simulation result of Example 4-2 



lossi I0SS2 zeroi zero2 percj^ perc2 

LASSO 1.11 (0.11)9.05 (4.35)125.66 (39.79)34.62 (8.28)15.99 (5.06)30.37 (7.26) 

Adaptive LASSOl. 14 (0.10)2.99 (2.17) 11.28 (10.35)66.80 (8.53) 1.44 (1.32)58.60 (7.48) 
SCAD 1.04 (0.10)0.81 (1.12) 62.72 (26.79)45.96 (9.35) 7.98 (3.41)40.32 (8.20) 



Table 5 
Simulation result of Example 4-3 



lossi I0SS2 zeroi zero2 percj^ perc2 



LASSO 0.88 (0.09)10.72 (4.93)88.54 (34.33)126.94 (12.57)12.61 (4.89)64.11 (6.35) 

Adaptive LASSO0.81 (0.07) 4.25 (2.93) 5.08 (6.71) 161.62 (6.16) 0.72 (0.96)81.63 (3.11) 
SCAD 0.75 (0.08) 0.77 (1.07)35.60 (23.03)145.28 (12.09) 5.07 (3.28)73.37 (6.11) 
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Fig. 5. For the 100 samples in Example 4-3, the average sparsity pattern recovery for 
the LASSO, adaptive LASSO and SCAD penalties are plotted in panels B, C and D, 
respectively, to compare with the true sparsity pattern (panel A). 

Table 6 
Simulation result of Example 4-4 



lossi 



10SS2 



zeroi 



zero2 



perCj^ 



perca 



LASSO 19.31 (0.43)1065.37 (82.56) 4009.75 (117.60)0.64 (1.24) 10.18 (0.30)0.11 (0.21) 
Adaptive 

LASSO 12.44 (0.92) 664.46 (129.35) 269.86 (61.97) 7.76 (4.11) 0.68 (0.16)1.30 (0.69) 

SCAD 10.55 (0.48) 288.26 (62.34) 3478.76 (106.73) 1.10 (1.67) 8.83 (0.27)0.18 (0.28) 



true LASSO adaptive LASSO SCAD 




93 100 150 200 50 100 1£0 50 100 150 200 SO 100 150 300 



Fig. 6. For the 100 samples in Example 4-4i the average sparsity pattern recovery for 
the LASSO, adaptive LASSO and SCAD penalties are plotted in panels B, C and D, 
respectively, to compare with the true sparsity pattern (panel A). 



Tables 3-7, improvements are observed for the adaptive LASSO and SCAD 
penalties over the LASSO penalty in terms of the two types of loss functions 
(especially the second type) and as well as the two types of errors regarding 
zero elements. 

5. Theoretical properties. In this section we provide some theoretical 
justifications. We first prove that the penalized log-likelihood function is in- 
creasing in each iteration using the LLA algorithm. The oracle properties of 
the SCAD and adaptive LASSO penalties will be established in our context. 

Without loss of generality, we may consider the case that the random 
vector is normally distributed with mean zero, that is, X ~ A^(0, Sq), where 



NETWORK EXPLORATION VIA THE ADAPTIVE LASSO AND SCAD PENALTIES 

Table 7 
Simulation result of Example 4-5 



lossi I0SS2 zeroi zeroa percj^ perc2 

LASSO 8.24 (0.27)1082.61 (112.61)796.16 (264.66)255.22 (13.57)2.02 (0.67)46.74 (2.49) 
Adaptive 

LASSO6.50 (0.21) 316.95 (53.99) 6.58 (4.92) 336.24 (4.51) 0.02 (0.01)61.58 (0.83) 
SCAD 6.65 (0.40) 32.33 (23.06) 224.98 (247.45)298.12 (21.24)0.57 (0.63)54.60 (3.89) 



is a vector of zeros and Sq is the true unknown p x p covariance matrix. 
The corresponding true precision matrix is fig = ^0 ^- sample consists 
of n independent and identically distributed observations xi, X2, . . . , x„. In 
this case the sample covariance matrix is defined by 

n 

(5.1) ± = J2^,^J/n. 

i=l 

Note here p is assumed to be fixed and we study asymptotic properties of 
our penalized estimates with the SCAD and adaptive LASSO penalties as 
the sample size n — > 00. 

Theorem 5.1. For a differentiable concave penalty function px{-) on 
[0,00], the penalized log-likelihood function is increasing through each itera- 
tion in the LLA approximation. 

See the Appendix for the proof of Theorem 5.1. 

Theorem 5.2. For n i.i.d. observations xi,X2,...,x„ from A^(0,So), 
the optimizer of the SCAD penalized log-likelihood function (2.4) with 
sample covariance given by (5.1) has the oracle property in the sense of Fan 
and Li (2001 ), when A ^ and ^/nX — > co as n — > cx). Namely: 



true LASSO adaplive LASSO SCAD 




Fig. 7. For the 100 samples in Example 4-5, the average sparsity pattern recovery for 
the LASSO, adaptive LASSO and SCAD penalties are plotted in panels B, C and D, 
respectively, to compare with the true sparsity pattern (panel A). 
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(1) Asymptotically, the estimate ft has the same sparsity pattern as the 
true precision matrix CIq. 

(2) The nonzero entries of the ft are y/n- consistent and asymptotically 
normal. 

See the Appendix for the proof of Theorem 5.2. 

Theorem 5.3. When y/nX = Op{l) and X^/naJi^ — > oo as n ^ oo, the 
oracle property also holds for the adaptive LASSO penalty with weights 
specified by Wij = l/\wij\'^ for some 7 > and any On-consistent estimator 
^ = {'^ij)i<i,j<p, that is, a„(fi - CIq) = Op{l). 

The proof of Theorem 5.3 can be found in the supplemental article [Fan, 
Feng and Wu (2008)]. 

APPENDIX 



Proof of Theorem 5.1. Define 



Qx{n) = logdetfl - - PA (kij I ) 

i=ii=i 



and 



i=ij=i 

" (^) 

Then, given estimate ft , we have 

(A.l) n^''^^^ = arg max ^xi^l^^''^)- 

Our goal is to prove that Qx{ft ) > Qx{^ )• At the /cth-step, consider 

n n 

= ^^{p,(l^f i)-pa(K-I)}. 
i=ii=i 

By the concavity of p\{-) over [0,oo), we have p\{\Cjlj'\) + P)^{\uJlj'\){\uJij\ - 

I'^Jf I) -Px{\^ij\) > 0- Then, we have Qx{fl) > <l> xi^l]^^''^ ) ■ Finally, by notic- 

" (fc) ~ (fc) » (fc) 

ing that Q\{^ ) = '&a(^^ \^ ) and using (A.l), we have 

ftin'""') > > <i.,(n'"|n"-') = Q,(n'"), 
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as desired. □ 



Proof of Theorem 5.2. It is enough to check conditions (A)-(C) of 
Fan and Li (2001). Since Xj are i.i.d. from A^(0, XIq), the probabihty density 
function for X is given by /(x, fio) = exp(— x"^riox/2)-\/detrio/(27r). The 
log-hkeUhood function of the precision matrix is given by 

n ^ 

^-(logdetn-xffix,) 

i=i ^ 

= ^(^logdetn-lf]xfnxi^ 

Ti 

= -(logdetn-tr(ns;)), 

up to a constant, where tr(-) denotes the trace operator. This justifies the 
log-hkehhood function given in Section 2 as weU. 
Notice that 



/ 9iog/(x,n) 

V duj. 



19 r 
= i^Eii- — (logdetSl -x^ fix) \n=no 

f2=no ^ (^^ij 



which reduces to (— I)*"'"-' detrio,-ij/(detrio) — co^jj wheni/j and ^(det fio,-ii/ 
(det Qq) 

~ ^o,ii) when i — j, where rig,— ij denotes the matrix after removing 
the ith row and jth column from CIq and cro,jj is the (i, j)-element of the co- 
variance matrix "Sq. Noting that fig = "^o^ , we have (—I)*"'"-' det rig, -ij /(det fig) 
o'o,ij = for i / j and |(det f2g^_jj/(det Slg) — ao^u) = when i = j, as we 

have desired. That is, -Enol^^^^^^T^^) 1^2=5^0 — Similarly, we can show that 

^no(g£-log/(x,n)^log/(x,n))|n=no=i^n„(-a^log/(x,n))|n=i2o- 
So condition (A) is satisfied by noting that /(x, fi) has a common support 
and the model is identifiable. 

To prove condition (B), it is sufficient to prove that the log-det function 
is concave. More explicitly, for the log-det function h{ft) = logdetri, we 
can verify concavity by considering an arbitrary line, given by fl = Z + tV, 
where Z,V £ Sp. We define g{t) = h{Z + tV) , and restrict g to the interval of 
values of t for which Z + tV £ Sp. Without loss of generality, we can assume 
t = is inside the interval, that is, Z £ Sp. We have 

g{t) = log det{Z + tV) 

= logdet(zV2(/ + tz-i/Vz-i/2)Zi/2) 
p 

= ^ log(l + tXi) + log det Z, 
1=1 
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where Ai, . . . , Ap are the eigenvalues of Z ^/"^VZ Therefore, we have 

Since g"{t) < 0, we conchide that h is concave. 

Condition (C) is easy to satisfy because the third order derivative does 
not involve x. □ 
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SUPPLEMENTARY MATERIAL 

Proof of Theorem 5.3 (DOT: 10.1214/08-AOAS215SUPP; .pdf). We gave 
a detailed proof of the oracle properties for the adaptive lasso penalty as 
stated in Theorem 5.3. 
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